Today we’ll be working with a dataset that I downloaded from the World Bank’s DataBank. The raw dataset from the DataBank is not amenable for direct analysis, so we will be working with a processed version. For those who are interested, you can find the raw dataset here and the script I used to process the data here.
First, let’s load the ggplot2
package. (Install the package with install.packages("ggplot2")
if you have not done so yet.)
library(ggplot2)
Next, run the following line of code to get the dataset as a data frame in R. Don’t worry about the syntax; treat this as a magical incantation for now.
df <- read.csv("http://web.stanford.edu/~kjytay/courses/stats32-aut2019/Session%203/worldbank_data_tidy.csv",
stringsAsFactors = FALSE)
We use the str()
, head()
and View()
functions to see the dataset:
str(df)
## 'data.frame': 2170 obs. of 11 variables:
## $ cty_name : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ cty_code : chr "AFG" "AFG" "AFG" "AFG" ...
## $ year : int 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 ...
## $ elecAccess: num 45.2 42.7 43.2 69.1 70.2 ...
## $ gdpPerCap : num 1455 1637 1627 1807 1875 ...
## $ compEduc : int 9 9 9 9 9 9 9 9 9 NA ...
## $ educPri : num NA NA NA NA NA NA NA NA NA NA ...
## $ educTer : num NA NA NA NA NA NA NA NA NA NA ...
## $ govEducExp: num NA 3.46 3.44 2.52 3.43 ...
## $ popYoung : num 47.9 47.8 47.3 46.7 46 ...
## $ pop : int 28394813 29185507 30117413 31161376 32269589 33370794 34413603 35383128 36296400 37172386 ...
head(df)
## cty_name cty_code year elecAccess gdpPerCap compEduc educPri educTer
## 1 Afghanistan AFG 2009 45.23712 1454.663 9 NA NA
## 2 Afghanistan AFG 2010 42.70000 1637.378 9 NA NA
## 3 Afghanistan AFG 2011 43.22202 1626.765 9 NA NA
## 4 Afghanistan AFG 2012 69.10000 1806.764 9 NA NA
## 5 Afghanistan AFG 2013 70.15348 1874.766 9 NA NA
## 6 Afghanistan AFG 2014 89.50000 1897.526 9 NA NA
## govEducExp popYoung pop
## 1 NA 47.90947 28394813
## 2 3.46196 47.79811 29185507
## 3 3.43785 47.34328 30117413
## 4 2.52441 46.73910 31161376
## 5 3.43437 46.03378 32269589
## 6 3.67390 45.28739 33370794
Here is a short summary of the variables in the dataset (other than the obvious ones):
elecAccess
: % of population with access to electricity.gdpPerCap
: GDP per capita based on purchasing power parity. This is a proxy for how rich a country is.compEduc
: Number of years that children are legally obliged to attend school.educPri
: % of population aged 25 and over that attained or completed primary education.educTer
: % of population aged 25 and over that attained or completed Bachelor’s or equivalent.govEducExp
: General government expenditure on education as a % of GDP.popYoung
: Population between the ages of 0 to 14 as a % of total population.pop
: Total population.For more details, here is the metadata file that the DataBank produced along with the raw dataset.
Try out some of the other summary functions we learnt last week. How many possible values does compEduc
take on? What years does this dataset span?
Histograms are a good way of understanding the distribution of a single continuous variable. In this dataset, we could be interested in the distribution of GDP per capita one variable of interest that is probably of greatest interest is price. Let’s plot a histogram of price to understand its distribution:
ggplot() +
geom_histogram(data = df, mapping = aes(x = gdpPerCap))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 250 rows containing non-finite values (stat_bin).
Notice that we received a warning message: “Removed 250 rows containing non-finite values (stat_bin).”. There happened to be 250 rows that had NA
as the value in the gdpPerCap
column; R is warning us about it. As a data analyst, it would be a good idea to find out why these rows did not have any data.
If we don’t specify data
or mapping
inside the geom_histogram()
function, it will look for these values in the ggplot()
function. Hence, the code below will give the same result:
ggplot(data = df, mapping = aes(x = gdpPerCap)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 250 rows containing non-finite values (stat_bin).
R defaults to 30 bins for the histogram. We can change this by adding a bins
argument to geom_histogram()
. As you can see from the histograms below, different bin widths can give very different interpretations of the data!
ggplot(data = df) +
geom_histogram(mapping = aes(x = gdpPerCap), bins = 2)
## Warning: Removed 250 rows containing non-finite values (stat_bin).
ggplot(data = df) +
geom_histogram(mapping = aes(x = gdpPerCap), bins = 100)
## Warning: Removed 250 rows containing non-finite values (stat_bin).
ggplot(data = df) +
geom_histogram(mapping = aes(x = gdpPerCap), bins = 1000)
## Warning: Removed 250 rows containing non-finite values (stat_bin).
Scatterplots are good for visualizing the relationship between 2 continuous variables. For example, what’s the relationship between primary education attainment and GDP per capita? Let’s make a scatterplot of GDP per capita vs. primary education attainment:
ggplot(data = df) +
geom_point(mapping = aes(x = educPri, y = gdpPerCap))
## Warning: Removed 1720 rows containing missing values (geom_point).
Wow, what a mess! That’s because we have so many data points being plotted over each other (this is called overplotting). Are there more points in the 70-80% range on the x-axis, or in the 95-100% range? It’s hard to tell. One way to address this is to modify the transparency of each point by adjusting “alpha”. By default, alpha = 1
, which represents being fully opaque. We can reduce alpha (alpha = 0.2
means that 5 points are needed to get full opacity). Notice that alpha
is not within the aes
brackets since it is not determined by the data.
ggplot(data = df) +
geom_point(mapping = aes(x = educPri, y = gdpPerCap), alpha = 0.2)
## Warning: Removed 1720 rows containing missing values (geom_point).
While alpha = 0.2
is probably too faint in this case, the characteristics of the data become more obvious.
We see that the scale on the y-axis is somewhat dominated by the points at the top. One way to make the points more spread out across the canvas is to perform a logarithmic transformation on the y-axis:
ggplot(data = df) +
geom_point(mapping = aes(x = educPri, y = log10(gdpPerCap)), alpha = 0.5)
## Warning: Removed 1720 rows containing missing values (geom_point).
There certainly seems to be a positive relationship here, although the range in GDP per capita seems fairly wide when primary education attainment is very high.
Does the relationship change of we look at tertiary education attainment instead? We make this scatterplot below. Instead of filled circles, we could change the shape of the points manually through the shape
argument (see this reference for which symbols correspond to each shape
value):
ggplot(data = df) +
geom_point(mapping = aes(x = educTer, y = log10(gdpPerCap)), shape = 4)
## Warning: Removed 1902 rows containing missing values (geom_point).
Let’s explore another relationship: years of compulsory education vs. GDP per capita. Let’s start with a scatterplot:
ggplot(data = df) +
geom_point(mapping = aes(x = compEduc, y = log10(gdpPerCap)))
## Warning: Removed 532 rows containing missing values (geom_point).
This is not very informative. We see a lot of overplotting going on: and that’s because compEduc
only takes on a few values. (Because of this, compEduc
can also be thought of as a categorical variable!) Let’s use the alpha trick that we used previously:
ggplot(data = df) +
geom_point(mapping = aes(x = compEduc, y = log10(gdpPerCap)), alpha = 0.2)
## Warning: Removed 532 rows containing missing values (geom_point).
In this case, changing alpha on its own is not going to help much, since all the points are still going to lie on one vertical line. Let’s add jitter (i.e. move the points by a small random amount) to get a better view.
ggplot(data = df) +
geom_point(mapping = aes(x = compEduc, y = log10(gdpPerCap)),
alpha = 0.2, position = "jitter")
## Warning: Removed 532 rows containing missing values (geom_point).
Because jittering is such a common operation, instead of adding position = "jitter"
as an argument to geom_point()
, we can use geom_jitter()
directly to get the same plot:
ggplot(data = df) +
geom_jitter(mapping = aes(x = compEduc, y = log10(gdpPerCap)),
alpha = 0.2)
## Warning: Removed 532 rows containing missing values (geom_point).
Because compEduc
only takes on a few values, we can think of it as a categorical variable. In the plots above, instead of plotting the individual points for each observation, we could draw a boxplot for each value of the categorical variable.
Because compEduc
is a numeric variable, we need to pass factor(compEduc)
to the geom_boxplot()
function instead so that it treats compEduc
like a categorical variable:
ggplot(data = df) +
geom_boxplot(mapping = aes(x = factor(compEduc), y = log10(gdpPerCap)))
## Warning: Removed 250 rows containing non-finite values (stat_boxplot).
For a violin plot, use geom_violin()
:
ggplot(data = df) +
geom_violin(mapping = aes(x = factor(compEduc), y = log10(gdpPerCap)))
## Warning: Removed 250 rows containing non-finite values (stat_ydensity).
Let’s say we want to look at how population changes over the time period that we have in our dataset. A natural first try would be the following:
ggplot(data = df) +
geom_line(aes(x = year, y = log10(pop)))
What’s going on here?? What’s happening is that for each value of year
, there are several rows associated with that value: one for each country! What we really want is one line per country. The way to achieve that is by specifying a group
option:
ggplot(data = df) +
geom_line(aes(x = year, y = log10(pop), group = cty_code), alpha = 0.2)
## Warning: Removed 7 rows containing missing values (geom_path).
The plot shows that population doesn’t change very much on the log scale, as one might expect. Do we see the same trend with the percentage of the population between 0-14 years?
ggplot(data = df) +
geom_line(aes(x = year, y = popYoung, group = cty_code), alpha = 0.2)
## Warning: Removed 237 rows containing missing values (geom_path).
There seems to be a slight downward trend.
These plots are not that useful because we are showing data for all the countries on a single plot: it ends up making the plot look like a mess. It might be better to just look at the trends for a handful of countries of interest.
Is there a relationship between the total population of a country and the relative proportion of the country being < 15 years old? Let’s make a scatterplot to find out:
ggplot(data = df) +
geom_point(aes(x = log10(pop), y = popYoung))
## Warning: Removed 237 rows containing missing values (geom_point).
That’s an interesting looking plot! We see very clear streaks of points, suggesting that they belong to the same group. One way to test this is to color code the points by the country they belong to:
ggplot(data = df) +
geom_point(aes(x = log10(pop), y = popYoung), col = cty_code)
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomPoint, : object 'cty_code' not found
What went wrong here? Because color depends on data, we should put it within the aes()
brackets. If col
is outside the brackets, R interprets our intent as changing the color of all points to cty_code
. But the variable cty_code
does not exist in our global environment, hence the error. This is the correct syntax:
ggplot(data = df) +
geom_point(aes(x = log10(pop), y = popYoung, col = cty_code))
## Warning: Removed 237 rows containing missing values (geom_point).
The legend ends up taking up the whole place of the plot… The code below can be used to remove the legend (we’ll learn more about themes next class):
ggplot(data = df) +
geom_point(aes(x = log10(pop), y = popYoung, col = cty_code)) +
theme(legend.position = "none")
## Warning: Removed 237 rows containing missing values (geom_point).
Let’s color the dots by year to see if there is a trend within each streak of dots:
ggplot(data = df) +
geom_point(aes(x = log10(pop), y = popYoung, col = year))
## Warning: Removed 237 rows containing missing values (geom_point).
To save a plot, click on the button, and click “Save as Image…” You can adjust the size of your image in the pop-up before saving it.
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 knitr_1.24 magrittr_1.5 tidyselect_0.2.5
## [5] munsell_0.5.0 colorspace_1.4-1 R6_2.4.0 rlang_0.4.0
## [9] stringr_1.4.0 dplyr_0.8.3 tools_3.6.1 grid_3.6.1
## [13] gtable_0.3.0 xfun_0.9 withr_2.1.2 htmltools_0.3.6
## [17] yaml_2.2.0 lazyeval_0.2.2 digest_0.6.20 assertthat_0.2.1
## [21] tibble_2.1.3 crayon_1.3.4 purrr_0.3.2 glue_1.3.1
## [25] evaluate_0.14 rmarkdown_1.15 labeling_0.3 stringi_1.4.3
## [29] compiler_3.6.1 pillar_1.4.2 scales_1.0.0 pkgconfig_2.0.2